Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86 CLMUL CRC rewrite #127

Merged
merged 10 commits into from
Jun 17, 2024
Merged

x86 CLMUL CRC rewrite #127

merged 10 commits into from
Jun 17, 2024

Conversation

Larhzu
Copy link
Member

@Larhzu Larhzu commented Jun 11, 2024

It's faster with both tiny and large buffers and doesn't require disabling any sanitizers.

With large buffers the extra speed is from folding chunks in parallel. For large buffers, faster versions are out there but it's diminishing returns in context of XZ Utils. So let's skip wider folding or AVX2 code at least for now.

Larhzu added 8 commits June 16, 2024 12:56
It's not enough to silence the address sanitizer. Also memory and
thread sanitizers would need to be silenced. They, at least currently,
aren't smart enough to see that the extra bytes are discarded from
the xmm registers by later instructions.

Valgrind is smarter, possibly because this kind of code isn't weird
to write in assembly. Agner Fog's optimizing_assembly.pdf even mentions
this idea of doing an aligned read and then discarding the extra
bytes. The sanitizers don't instrument assembly code but Valgrind
checks all code.

It's better to change the implementation to avoid the sanitization
attributes which also look scary in the code. (Somehow they can look
more scary than __asm__ which is implictly unsanitized.)

See also:
#112
#122
It's a standalone program that prints the required constants.
It's won't be a part of the normal build of the package.
Now it refers to crc_clmul_consts_gen.c. vfold8 was renamed to mu_p
and the p no longer has the lowest bit set (it makes no difference
as the output bits it affects are ignored).
By using modulus scaled constants, the final reduction can
be simplified.
This way it's clearer that two things cannot be selected
at the same time.
@Larhzu Larhzu force-pushed the crc_clmul_rewrite branch from 8931e99 to 888cef9 Compare June 16, 2024 10:30
Larhzu added 2 commits June 17, 2024 15:00
It's faster with both tiny and large buffers and doesn't require
disabling any sanitizers. With large buffers the extra speed is
from folding four 16-byte chunks in parallel.

The 32-bit x86 with MSVC reportedly still needs a workaround.
Now the simpler "__asm mov ebx, ebx" trick is enough but it
needs to be in lzma_crc64() instead of crc64_arch_optimized().
Thanks to Iouri Kharon for testing and the fix.

Thanks to Ilya Kurdyukov for testing the speed with aligned and
unaligned buffers on a few x86 processors and on E2K v6.

Thanks to Sam James for general feedback.

Fixes: #112
Fixes: #122
On E2K the function compiles only due to compiler emulation but the
function is never used. It's cleaner to omit the function when it's
not needed even though it's a "static inline" function.

Thanks to Ilya Kurdyukov.
@Larhzu Larhzu force-pushed the crc_clmul_rewrite branch from 3427b26 to 30a2d5d Compare June 17, 2024 12:03
@Larhzu Larhzu merged commit 30a2d5d into master Jun 17, 2024
8 checks passed
@Larhzu Larhzu deleted the crc_clmul_rewrite branch June 17, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant